Search CORE

85 research outputs found

A stochastic de novo assembly algorithm for viral-sized genomes obtains correct genomes and builds consensus

Author: Bucur Doina
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

University of Twente Research Information

Gender homophily in online book networks

Author: Bucur Doina
Publication venue
Publication date: 01/05/2019
Field of study

We measure the gender homophily (and other network statistics) on large-scale online book markets: amazon.com and amazon.co.uk, using datasets describing millions of books sold to readers. Large book networks are created by sales (two books are connected if many readers have bought both books) and can recommend new books to buy. The networks are analysed by the gender of their first author: is book consumption assortative by gender? Book networks are indeed gender-assortative: readers globally prefer to read from one author gender (the global assortativity coefficients by gender is around 0.4). Although 33% of first authors among all books are female, female books are not proportionally sold together with male books: an average of 20% (and median of 11%) of books co-bought with male books are female books. Instead, female books make up on average more than half of the books co-bought with other female books. The gender makeup of literary genres and structural book communities show that the gender homophily originates in a gender skew not only in certain literary genres (a fact known from prior studies), but even more strongly in certain book communities, with these book communities spanning multiple literary genres

University of Twente Research Information

On the Gender of Books:Author Gender Mixing in Book Communities

Author: Bucur Doina
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

University of Twente Research Information

The semantics of constellation line figures

Author: Bucur Doina
Publication venue
Publication date: 14/08/2023
Field of study

We answer the question whether, when forming constellations in the night sky, people in astronomical cultures around the world consistently imagine and assign the same symbolism to the same (type of) star cluster. Evidence of semantic universality has so far been anecdotal. We use two complementary definitions for a star cluster: (1) a star group in a particular sky region (regardless of its exact shape), and (2) a star group with a particular shape and brightness (regardless of its location in the sky). Over a dataset of 1903 constellations from 75 astronomical cultures, we find semantic parallels which are likely culturally induced: body parts in the sky region delineated by the International Astronomical Union (IAU) as Ori, fish in Cru and Sco, geometric symbols in Cru, groups in UMa, mammals in UMa, and reptiles in Sco. Surprisingly, we find many more significant semantic parallels which can only be naturally induced by the shape and composition of the star pattern underlying a constellation (or, are endogenous to the sky rather than culture-dependent): arthropods in IAU Sco, body parts in Tau, geometric and group symbols in star clusters (regardless of sky region) with a small number of bright stars comparable in magnitude, humanoids and mammals naturalistically drawn in star clusters with large spatial diameter and many stars, landscapes in IAU Eri, man-made objects of various types in many IAU regions, and reptiles consistently drawn in star clusters with low aspect ratio or low branching in the minimum spanning tree drawn over the stars. These naturally induced semantics show that there are universal (rather than only cultural) thought patterns behind forming and naming constellations.Comment: Part 2 of arXiv:2110.12329 published in PLOS ONE 17(7): e0272270 (2022). Shares the same datase

arXiv.org e-Print Archive

Top influencers can be identified universally by combining classical centralities

Author: Bucur Doina
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/08/2020
Field of study

Information flow, opinion, and epidemics spread over structured networks. When using individual node centrality indicators to predict which nodes will be among the top influencers or spreaders in a large network, no single centrality has consistently good ranking power. We show that statistical classifiers using two or more centralities as input are instead consistently predictive over many diverse, static real-world topologies. Certain pairs of centralities cooperate particularly well in statistically drawing the boundary between the top spreaders and the rest: local centralities measuring the size of a node's neighbourhood benefit from the addition of a global centrality such as the eigenvector centrality, closeness, or the core number. This is, intuitively, because a local centrality may rank highly some nodes which are located in dense, but peripheral regions of the network---a situation in which an additional global centrality indicator can help by prioritising nodes located more centrally. The nodes selected as superspreaders will usually jointly maximise the values of both centralities. As a result of the interplay between centrality indicators, training classifiers with seven classical indicators leads to a nearly maximum average precision function (0.995) across the networks in this study.Comment: 14 pages, 10 figures, 4 supplementary figure

arXiv.org e-Print Archive

University of Twente Research Information

Beyond ranking nodes: Predicting epidemic outbreak sizes by network centralities

Author: Bucur Doina
Holme Petter
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 11/02/2020
Field of study

Identifying important nodes for disease spreading is a central topic in network epidemiology. We investigate how well the position of a node, characterized by standard network measures, can predict its epidemiological importance in any graph of a given number of nodes. This is in contrast to other studies that deal with the easier prediction problem of ranking nodes by their epidemic importance in given graphs. As a benchmark for epidemic importance, we calculate the exact expected outbreak size given a node as the source. We study exhaustively all graphs of a given size, so do not restrict ourselves to certain generative models for graphs, nor to graph data sets. Due to the large number of possible nonisomorphic graphs of a fixed size, we are limited to 10-node graphs. We find that combinations of two or more centralities are predictive (

R^2

scores of 0.91 or higher) even for the most difficult parameter values of the epidemic simulation. Typically, these successful combinations include one normalized spectral centralities (such as PageRank or Katz centrality) and one measure that is sensitive to the number of edges in the graph

arXiv.org e-Print Archive

Directory of Open Access Journals

University of Twente Research Information

Improved search methods for assessing Delay-Tolerant Networks vulnerability to colluding strong heterogeneous attacks

Author: Bucur Doina
Iacca Giovanni
Publication venue
Publication date: 01/09/2017
Field of study

Increasingly more digital communication is routed among wireless, mobile computers over ad-hoc, unsecured communication channels. In this paper, we design two stochastic search algorithms (a greedy heuristic, and an evolutionary algorithm) which automatically search for strong insider attack methods against a given ad-hoc, delay-tolerant communication protocol, and thus expose its weaknesses. To assess their performance, we apply the two algorithms to two simulated, large-scale mobile scenarios (of different route morphology) with 200 nodes having free range of movement. We investigate a choice of two standard attack strategies (dropping messages and flooding the network), and four delay-tolerant routing protocols: First Contact, Epidemic, Spray and Wait, and MaxProp. We find dramatic drops in performance: replicative protocols (Epidemic, Spray and Wait, MaxProp), formerly deemed resilient, are compromised to different degrees (delivery rates between 24% and 87%), while a forwarding protocol (First Contact) is shown to drop delivery rates to under 5% — in all cases by well-crafted attack strategies and with an attacker group of size less than 10% the total network size. Overall, we show that the two proposed methods combined constitute an effective means to discover (at design-time) and raise awareness about the weaknesses and strengths of existing ad-hoc, delay-tolerant communication protocols against potential malicious cyber-attacks

University of Twente Research Information

Large-scale multi-objective influence maximisation with network downscaling

Author: Bucur Doina
Cunegatti Elia
Iacca Giovanni
Publication venue
Publication date: 14/04/2022
Field of study

Finding the most influential nodes in a network is a computationally hard problem with several possible applications in various kinds of network-based problems. While several methods have been proposed for tackling the influence maximisation (IM) problem, their runtime typically scales poorly when the network size increases. Here, we propose an original method, based on network downscaling, that allows a multi-objective evolutionary algorithm (MOEA) to solve the IM problem on a reduced scale network, while preserving the relevant properties of the original network. The downscaled solution is then upscaled to the original network, using a mechanism based on centrality metrics such as PageRank. Our results on eight large networks (including two with

\sim

50k nodes) demonstrate the effectiveness of the proposed method with a more than 10-fold runtime gain compared to the time needed on the original network, and an up to

82\%

time reduction compared to CELF

arXiv.org e-Print Archive

University of Twente Research Information

Independent Prototype Propagation for Zero-Shot Compositionality

Author: Bucur Doina
Burghouts Gertjan
Ruis Frank
Publication venue
Publication date: 01/01/2021
Field of study

Humans are good at compositional zero-shot reasoning; someone who has never seen a zebra before could nevertheless recognize one when we tell them it looks like a horse with black and white stripes. Machine learning systems, on the other hand, usually leverage spurious correlations in the training data, and while such correlations can help recognize objects in context, they hurt generalization. To be able to deal with underspecified datasets while still leveraging contextual clues during classification, we propose ProtoProp, a novel prototype propagation graph method. First we learn prototypical representations of objects (e.g., zebra) that are conditionally independent w.r.t. their attribute labels (e.g., stripes) and vice versa. Next we propagate the independent prototypes through a compositional graph, to learn compositional prototypes of novel attribute-object combinations that reflect the dependencies of the target distribution. The method does not rely on any external data, such as class hierarchy graphs or pretrained word embeddings. We evaluate our approach on AO-Clever, a synthetic and strongly visual dataset with clean labels, and UT-Zappos, a noisy real-world dataset of fine-grained shoe types. We show that in the generalized compositional zero-shot setting we outperform state-of-the-art results, and through ablations we show the importance of each part of the method and their contribution to the final results

arXiv.org e-Print Archive

University of Twente Research Information

Automated fault tree learning from continuous-valued sensor data: a case study on domestic heaters

Author: Bucur Doina
Budde Carlos E.
Verkuil Bart
Publication venue
Publication date: 13/03/2022
Field of study

Many industrial sectors have been collecting big sensor data. With recent technologies for processing big data, companies can exploit this for automatic failure detection and prevention. We propose the first completely automated method for failure analysis, machine-learning fault trees from raw observational data with continuous variables. Our method scales well and is tested on a real-world, five-year dataset of domestic heater operations in The Netherlands, with 31 million unique heater-day readings, each containing 27 sensor and 11 failure variables. Our method builds on two previous procedures: the C4.5 decision-tree learning algorithm, and the LIFT fault tree learning algorithm from Boolean data. C4.5 pre-processes each continuous variable: it learns an optimal numerical threshold which distinguishes between faulty and normal operation of the top-level system. These thresholds discretise the variables, thus allowing LIFT to learn fault trees which model the root failure mechanisms of the system and are explainable. We obtain fault trees for the 11 failure variables, and evaluate them in two ways: quantitatively, with a significance score, and qualitatively, with domain specialists. Some of the fault trees learnt have almost maximum significance (above 0.95), while others have medium-to-low significance (around 0.30), reflecting the difficulty of learning from big, noisy, real-world sensor data. The domain specialists confirm that the fault trees model meaningful relationships among the variables.Comment: Preprint submitted to the International Journal of Prognostics and Health Management - March 202

arXiv.org e-Print Archive

University of Twente Research Information